Error

LW server reports: not allowed.

This probably means the post has been deleted or moved back to the author's drafts.

Matteo Migliarini 11 Jun 2025 13:20 UTC
1 point
0
A lower score means a better match. As expected, a survey of all heads in the model shows that head 1.5 is an outlier with a uniquely low score, confirming it’s specialized for this task.
I will replace this image
Matteo Migliarini 11 Jun 2025 13:19 UTC
1 point
0
TL;DR: GPT-2′s head 1.5 directs attention to semantically similar tokens and actively suppresses self-attention. This mechanism is driven by a symmetric bilinear form with negative eigenvalues, which enables suppression. The head computes attention purely based on token identity, independent of position. We cluster tokens semantically, interpret the weights to explain the attention scores, and steer self-suppression by tuning eigenvalues.
I’d like to aknowledge that I did this work during my time at ARENA, but I’m not sure whether I should do it at beginning or end